Despite the close relationship between speech perception and production,research in automatic speech recognition (ASR) and text-to-speech synthesis(TTS) has progressed more or less independently without exerting much mutualinfluence on each other. In human communication, on the other hand, aclosed-loop speech chain mechanism with auditory feedback from the speaker'smouth to her ear is crucial. In this paper, we take a step further and developa closed-loop speech chain model based on deep learning. Thesequence-to-sequence model in close-loop architecture allows us to train ourmodel on the concatenation of both labeled and unlabeled data. While ASRtranscribes the unlabeled speech features, TTS attempts to reconstruct theoriginal speech waveform based on the text from ASR. In the opposite direction,ASR also attempts to reconstruct the original text transcription given thesynthesized speech. To the best of our knowledge, this is the first deeplearning model that integrates human speech perception and productionbehaviors. Our experimental results show that the proposed approachsignificantly improved the performance more than separate systems that wereonly trained with labeled data.
展开▼